perm filename HE[D,LES] blob
sn#152215 filedate 1975-05-05 generic text, type T, neo UTF8
.s HAND/EYE SYSTEMS
\\pers Thomas O.Binford, Lynn H. Quam, David Grossman,
Robert C. Bolles, Donald Gennery, Raphael Finkel, Kicha
Ganapathy, Hans P. Moravec, Russell H. Taylor, Victor D. Scheinman,
Yefim Schukin, Bruce E. Shimano, Kurt Widdoes.
The main scientific goal of the hand/eye project is to understand
those facilities of machine intelligence which involve interaction
with the real world through perceptual and motor functions. Our aim
is to design, build, and test machines with perception and
manipulation.
Perception and motor functions are intelligent
functions in that they require many of the same mechanisms and
representations as other areas of intelligence. The primary problem
in all these areas is to bring to bear knowledge at all levels in the
system, through a world model. This is done by representation of
specific domains in detail, and description mechanisms dictated by
the representations. Goal-directed visual systems are a current
topic of great interest. They require a planning or strategy
component, whose power comes from use of knowledge and models
specific to vision.
Perception and motor control are unique in
having an immediate need for representing shape and geometry,
geometrical operations, and representing control structures for
geometry. All areas of artificial intelligence converge on the use of
representation of various aspects of world knowledge. Natural
language will eventually require representations of geometry for
everyday notions of spatial relations, physics and the physical
world. Automatic programming will require geometrical
representations as soon as it begins to deal with geometry, or the
physical world.
As in speech understanding, descriptive mechanisms
of vision function in a world with noise and uncertainty, and we must
define the similarity of symbolic descriptions at various levels of
detail, perturbed by noise and distorted by transformations like
perspective. Speech, however, is a set of conventions designed to be
understood, in spite of noise. The world did not design itself to be
seen.
.oi
Our effort will concentrate on (1) how to represent visual quantities
and operations in machines, and (2) how they are used to program
classes of tasks, not just isolated tasks. We have not made any
mention of what part actual sensors and manipulators play in this
research. We need to test our ideas and algorithms by experiments in
order to develop adequate representations and theories. We can do a
lot of useful modeling on an <a priori> basis. However, we don't
expect to model the entire world. It is often more difficult to
model the world than to use the world as a model. Particularly for
our applications, we need carefully chosen real world
experimentation.
The application goals of the group are to design and build
programming systems for vision and manipulation, for use in
industrial assembly, inspection, assembly and maintenance in
hazardous environments, handling toxic materials, and vehicle
navigation. Our intent in assembly is to make it possible for
production engineers without expertise in computer science to set up
and program assemblies in a relatively short time. That ability is
particularly important for products made in short production runs
with frequent changes, e.g. for airframes or replacement parts. Spares
could be produced on demand by stockpiling tapes instead of hardware.
For batch production, the facilities of the assembly system must be
more powerful than for high volume production. Since much less
design effort can go into special jigs and fixtures, the system must
be much more versatile to justify the investment, by paying off over
several product runs. Vision can be important for batch production
where versatility and easy interfacing are important. Facilities for
setup which use vision, such as simplified calibration and
self-calibration, have economic importance for short production runs.
The system is intended to make use of design data from computer-aided
design. These techniques are also useful in assembly and maintenance
in hazardous environments, even as aids to teleoperator devices.
Related techniques might
be applied to guidance of remotely piloted aircraft in hostile
environments. This approach could be regarded as an intelligent
man-machine interface. The system which we build up for the domain
of assembly contains the basis for other application domains.
We have carried out specific tasks illustrating general issues in
building research systems which are models for practical systems for
industrial assembly. We have found this strategy defines sharp
research questions. These issues are: representations of physical
constraints and solutions which allow the system to keep track of the
envelope of possible object positions; representations which allow
obstacle avoidance to be carried out in planning and execution;
representing visual features and operators and expressing these
representations as a vision language; coarse descriptions abstracted
from object descriptions to be used in recognition to select a
subclass of similar objects to match against.
Our system merits support because of the integration of our
manipulation work with vision, because of the planning model and very
high level language system, because of our advanced object
representation facilities, and because of the extent of our progress
in the past in complete systems for manipulation. Our subgoals are
as follows:
.bs;ib; preface 1;fac;
(1) To design and build a programming system for manipulation, implement
touch and force sensors,
implement control algorithms for touch, force control and
cooperating manipulators. As a test of that system, to assemble a two
stroke gasoline engine. Although we have done major subassemblies of
this task, we think the problem is still difficult enough to test our new
sensing and control facilities and representation capabilities; in
particular, we will couple in visual control. We are still in the
midst of system implementation and we are some months away from task
execution. At that time we may choose another task.
(2) To provide a very high level language system with a knowledge
base for programming the manipulator system and to apply that
knowledge base to translate programs written in terms of
assembly-oriented primitives into runnable manipulator programs.
This is a prime scientific goal because of the focus it provides for
studying basic representational issues in robotics and has great
potential value for increasing the cost-effectiveness of manipulator
programming. These aspects are discussed more fully below.
(3) To provide a vision language and a visual feedback system which
can be programmed easily to locate features in scenes where a map is
available. To apply the system to simplify programming the
determination of positioning error for real parts being assembled
(screw in hole, sleeve over shaft, aligning two surfaces). The
resulting program must execute in about 1 second. Objects may be
curved, shiny, dirty, with no special painting or preparation. We
assume variation of less than 1 cm. in part location, and only small
rotations. The system is intended to be programmed by people without
experience in computer vision. A planning program will interface
between the user's model and detailed vision programs.
.end
.ss Manipulation
%2A man can do a lot blind, but much less with one arm, only two
fingers, and poor sense of force and touch. We have reached the
limit of what we can do with crude touch and force sensing.%1
.cb Accomplishments
We have built up a position of leadership in computer-controlled
manipulation by a coherent program of software, hardware and
experimentation: We have the most complete and integrated
computer-controlled manipulation system. It is currently as simple
and in many cases simpler to program our computer-controlled
manipulators than to perform the same tasks with conventional
manipulators in "record and playback" mode. We designed and built
high quality manipulators [Scheinman], duplicates of which have been
built or bought by JPL, SRI, GM, National Bureau of Standards,
University of Illinois, Purdue University, and Boston University.
Scheinman designed a small scale version of the arm while he was
visiting at MIT. A company has been formed which has 8 of these arms
completed or
in production, for Texas Instruments, MIT, University of
Illinois, SRI, and Purdue University.
We were first to automatically
generate trajectory control with software servo [Paul]. Others used
point to point motion with hardware servo. Our trajectory planning
had the advantage that it was easy to generate smooth motions for
complete actions. For comparison, it was easier for us to program
the trajectories in our assemblies by "learning by doing" than
similar assemblies done with Unimates in "record and playback" mode.
The arm servo contained several new features: a predictive Newtonian
dynamic model of the arm, including inertia and gravity forces;
feedback as a critically damped linear system; trajectory
modification which allows self-adaptation of planned trajectories to
accomodate variations in position and contingencies. These were
embedded in a manipulation system called WAVE, an interpretive hand
language, split into a small arm servo program written in PDP-6
assembly language, and a trajectory planning program written in SAIL
for the PDP-10. WAVE was not intended to be exportable, but SAIL is
available on PDP-10 systems.
The planning and servo techniques apply
to the class of arms for which analytic or numerical
solutions exist.
At present, solutions do not exist for arms with redundant degrees of freedom.
Use of the language WAVE made possible a degree of
generality; we evolved a library of macros for assembly which made
successive tasks increasingly quick to program.
We were the first
group to perform computer-controlled assembly, as a series of planned
experiments to test control facilities. An automobile water pump,
piston-crank subassembly and clutch subassembly of a two stroke
gasoline engine, and tool changing were programmed.
At about that time, the University of Edinburgh programmed assembly of
a toy car [Ambler].
Since that time,
Kawasaki and Unimation performed assemblies using
"record and playback" mode. IBM has assembled a toy and a complex
subassembly of a typewriter, which is probably the most complicated
computer-controlled assembly. Hitachi has made a special purpose
assembly device using force feedback [Goto] which is in production.
Inoue at MIT has performed an assembly using force feedback [Inoue].
We have used crude force feedback from measurement of motor torques,
crude touch sensing, and searching as follows: to increase the
tolerance of assemblies; to require only crude position and alignment
information and self-correct the estimates (to simplify setup and
programming); to allow contingency action; to continuously monitor
positions and correct for drifts in calibration. These
self-calibration and self-alignment facilities are shown in a film
we have produced [3].
We have programmed
synchronized, non-simultaneous manipulators; we have recently found
that a Japanese group [Nakano] programmed simultaneous coordinated
motion of two manipulators at about the same time. We have
implemented a force balance with a sensitivity of about 25 grams, an
order of magnitude better than current force sensing. We have an
experimental touch sensor with 1 gm sensitivity, an order of
magnitude better than the current touch sensor. Both these sensors
need to be extensively evaluated and interfaced.
.cb Language for Programmable Assembly
Paul, Finkel, Taylor, and Bolles have designed a language for
programmable assembly, AL [Finkel], to succeed WAVE. AL is intended
as a research vehicle and is designed to be modified. It is
written in SAIL and is fairly transportable.
We believe that any language should
be completely implemented in a public version after it is developed.
The manipulation portion of AL will be fixed and final by July 1977.
The planning portion of AL will be well-developed at that stage, but
developments in strategy and planning systems will be desirable to
include in AL as they occur.
WAVE was a language at the assembly
language level. AL has a variety of structures which would be quite
difficult to put into WAVE. AL has an ALGOL-like control structure
to allow structured programs. Multiple processes are provided for
simultaneous control of several devices. Interrupts are implemented
by ON-monitors. Trajectories are specified with greater flexibility,
and more versatile force control will be included.
A planning system is included, as described in detail below.
Briefly, it provides for specifying planning values of positions and
attachment relations, and keeps track of planning values as control
passes through control structures. The language is potentially
useful for a broad class of devices by including special solution
programs for each device. Currently, the class of manipulators is those
devices with non-redundant degrees of freedom. It probably is not
difficult to provide solution programs for devices with redundant
degrees of freedom, by specifying constraints on the motion.
.cb Sensing for Manipulation
We have found no force and touch sensors currently available which are
adequate for our purposes. We have made extensive surveys [Binford]
and publicized our requirements to interest commercial development of
sensors, and to promote cooperative development. If there were any
which were roughly adequate, we would use them in preference to
developing our own. If at any time we find such devices, we will
terminate our own efforts at sensor development as soon as possible.
At any rate, we will probably need to develop computer interfaces,
which are not trivial in this case. We need to develop improved
force and touch sensing hardware. We will need to evaluate and
interface sensors which we have developed. If difficulties develop,
it may be necessary to develop alternate touch sensing techniques
which are simpler to interface. We have made a study of a driven
piezoelectric touch sensor which is promising but requires further
development. We will implement force sensors with sensitivity of 20
grams. We will implement touch sensors with sensitivity of 2 grams
on a 2x3 matrix for each finger. Those are state of the art under
the constraints which make them usable on a manipulator.
What will
we be able to do with such sensing that we could not do without? Touch
can be quite sensitive, but is unusable with tools where contact is
at the tool and not at the fingers. Force sensing operates at a
distance. We can extend arm control to delicate and dexterous
operations; to adaptive grasping of irregular objects; to many small
assemblies which are easily damaged; to picking up light objects
which would move away previously; to inserting a sleeve over a shaft
without binding; to exploring with touch; and to inserting screws
into holes by tipping them and feeling when they drop into the hole.
Previously, we have used force sensing based on motor torques with a
sensitivity of 500 grams, and touch sensing based on a single
microswitch per finger with sensitivity of 10 grams. The previous
sensing abilities were crude enough to strongly limit our
manipulation abilities.
.cb Cooperative Control of Manipulators
We will formulate a theoretical basis for force control of
manipulators, determine how sensitive force control can be with our
manipulators [Scheinman], optimize those abilities, provide language
primitives, and implement them. There does not exist adequate analysis
of force control of manipulators. Whitney has written about the
subject, but without a solution.
We have formulated new force control and synchronization
primitives in AL, but we require experimentation to evaluate and make
more adequate primitives. We will analyze and experiment with
cooperative control of two manipulators, to design and implement
language primitives for AL. Although there is a Japanese paper on
the subject, it is only a beginning and much more remains to be done.
We need coordinated manipulation as a basis for other
manipulation experiments. Typical tasks requiring cooperative
control are installing limp or semi-rigid gaskets, carrying heavy
objects, picking up irregular objects, and carrying liquid in open
containers. We will carry out tasks which require two arms in
cooperation, sensitive touch, and force, from among the tasks in this
and the preceding paragraph, and in integrated assembly tasks.
.cb "`Very High Level' language for automation"
One very important factor in determining how widely the potential
advantages of programmable manipulator systems can be realized is the
ease with which the necessary programming can be done by engineering
personnel who are not necessarily expert computer scientists.
To date, languages for control of manipulators have been very
explicit, requiring:
.bs; preface 1;fac;
(1) very detailed descriptions of the specific motions and
sensor tests to be made.
(2) a great deal of book keeping by the user, to keep track of
expected positions of objects and of what calculations must be
made to update position variables as a result of sensory
tests, and the like.
(3) an intimate understanding of the manipulation system.
(4) a fairly high degree of programming sophistication.
.end
AL seeks to minimize, in so far as possible, the burdens that these
characteristics place on the user. For instance, it provides a means
by which one variable may be "affixed" to another, so that if one is
changed then the other will be updated appropriately, thus
substantially reducing the amount of explicit bookkeeping required.
Despite such niceties, writing code at the manipulator control level
still requires a fairly high degree of programming sophistication.
Typically, the gross structure of programs is fairly simple, and may
be described by a sequence or partial ordering of fairly well defined
subtasks. The "fine" structures, however, are somewhat more
complicated. Typically, there may be loops, tests for exception
conditions, specifications for force or tactile feedback, and so
forth.
The point here is that many users do not particularly care about such
details. It would be much easier for them to specify tasks at
somewhat higher levels of abstraction. For instance, an assembly
engineer who wants to put together a small rotary pump should be able
to write something like:
.bc;verbatim;
:
COMMENT This is high-level AL code.;
FIT pumphead ONTO pumphousing
WITH ALIGNMENT
housing.studx IN head.holex
housing.study IN head.holey;
INSERT bolt1 INTO head.hole1
WITH TORQUE = 10*FT*LB
USING TOOL driver;
INSERT bolt2 INTO head.hole2
WITH TORQUE = 10*FT*LB
USING TOOL driver;
INSERT drainplug INTO sidehole;
COMMENT and so forth;
:
.end
and allow the system to fill in the details, rather than coding them
himself. Such a
code sequence can easily be written in a few minutes, whereas
the corresponding manipulator program takes even an "expert"
programmer several hours to write and debug. Furthermore, unless our
hypothetical engineer is an above-average hand-eye programmer, the
result is more apt to be what he wants.
Essentially, the research question posed here is "How can expert
knowledge about hand-eye programming be codified into a system so
that it can be accessible to a non-expert user?"
In addition to having great "applications" importance, we believe
that work in this area provides a useful framework for research on a
number of related problems. These include:
.bs;preface 1;fac;
(1) The representation of information about physical
situations and description of objects in a form conducive to
reasoning about them.
(2) Characterization of how sensory information affects what
you know about a physical situation and how a given fact
affects how accurate your knowledge can be assumed to be.
(3) Codification of knowledge about techniques for
accomplishing particular subtasks. Such knowledge includes
what restrictions must be placed on object locations for a
technique to work, what is accomplished, how much error can be
tolerated, what extra information may be gained as a result of
using it, an outline or program skeleton giving the basic code
required, and so forth.
(4) Understanding of how different parts of a program can
affect each other and of how to "fill in" details for one part
in a manner consistent with the requirements of other parts.
.end
We have chosen small scale mechanical assembly as a good domain for
investigating the incorporation of such specialized knowledge into
AL. There are a number of reasons why this is an attractive choice.
.bs;preface 1;fac;
(1) The situations one encounters in the domain are generally
fairly constrained, thus simplifying somewhat the burden
placed on the modelling and planning systems.
(2) The use of sensory feedback techniques can significantly
reduce the requirements for expensive fixtures. Thus, a
system that helps plan the use of such techniques has a "live"
application.
(3) It is possible to describe interesting and useful tasks in
the assembly domain with a relatively small number of
"primitive" operations.
(4) Most of the underlying mechanisms are general enough to be
transferred to other manipulatory domains.
.end
The initial system will have three basic assembly-oriented primitives
(insertion of shafts and screws into holes, fitting nuts & washers
over shafts, and mating surfaces of two objects according to simple
alignment specifications), together with a small set of "service"
primitives like "pick up" and "place". These suffice to describe a
surprisingly large class of tasks.
One obvious method of providing convenient task description
formalisms is to combine commonly occurring code sequences into
"macro operations" and then allow the user to write programs in terms
of those operations. Unfortunately, however, this solution is
inadequate for the assembly domain. Frequently there are a number of
ways to do a particular subtask. Which is "right" depends very
largely upon what other subtasks must be done. Similarly, it is
frequently possible to perform part of one subtask (or, at least, to
gather useful information) in the course of doing another. Such
considerations are in general very difficult to express within the
paradigm of macro expansion.
We will use progressive refinement to produce consistent and
reasonably efficient programs. Here, the user's initial program
specification is rewritten into successively more detailed versions,
until an executable program written in terms of the "low level"
motion and sensor control statements is produced. The principal
advantage of such a breadth-first approach is that it allows
individual decisions to be made within the context of other tasks.
[Sacerdoti] uses a somewhat similar approach in planning for the SRI
computer-based consultant system. Aside from a number of fairly
substantial differences in how the basic paradigm is implemented,
this work differs from his in the level of detail under consideration
and in that the SRI problem solver is seeking to guide a human
repairman, whereas AL is producing a computer program that runs a
manipulator.
AL's implementation of progressive refinement relies upon
inter-process communication mechanisms to help assure that decisions
are made compatibly. Briefly, this works as follows: knowledge about
the assembly primitives and the various manipulation statements is
encoded into procedures within the system. Then, with each statement
in a program graph, the system associates a process instantiation of
the appropriate procedure. Each process then becomes responsible for
keeping the system's model of its expected effects up to date. The
processes associated with simple motion statements have very little
else to do. The "higher level" operations are responsible for
suggesting further elaborations of themselves (into more detailed
substructures) and for evaluating the effects of changes proposed by
other processes upon their own execution.
.cb The planning model
The planning model is one of the central features of the AL design,
and in many ways is analogous to the sort of book keeping done by an
algebraic compiler, which must keep track of register assignments,
temporary variables, and similar information. Essentially, the
system uses its "understanding" of the semantics of AL statements to
maintain a data base containing information about each point in the
program graph. This data base is used at all levels of the planning
system. For low level AL, the principal use of the planning model is
to keep track of expected values of variables, especially those used
to hold coordinate frames. These planning values, in turn, are used
in preparing motion trajectories.
At higher levels, the planning model provides the essential basis
upon which the system can base decisions on how to translate
task-oriented statements into the apropriate motion primitives.
In this case, the system needs a much better understanding of the
expected state of the "world" at each point in the program.
Frequently, this information is most conveniently specified in terms
of semantic relations between objects. For instance, one can
describe the location of a cup by saying that it is sitting upright
in some region on a table. This information may then be reflected in
a set of mathematical constraints on the location variables of the
cup. Such constraints may then be solved to give an estimate of
possible locations, and a similar analysis can be used to estimate
likely error values. In AL, we will make extensive use of both the
symbolic and the mathematical form of constraint relations in order
to help in forming plans.
.cb Accomplishments
The high-level language constructs for AL have been designed and used
to describe several sample assembly tasks including the assembly of
a simple water pump, assembly of a metal box, attachment of a bracket
to a beam, and similar tasks. The task-oriented primitives have all
been described in terms of manipulator actions. Object
representations have been developed and debugged, as have procedures
for deriving the accuracy prerequisites of one of the primitives
(insertion of a shaft into a hole) from the parts descriptions.
A preliminary version of the problem solving paradigm was written and
debugged during the early design phases of AL. It generated outline
plans for the pump assembly task from partial orders of high-level
operations, and selected workpiece positions and subtask orderings so
as to minimize superfluous repositioning of the workpiece and
unnecessarily returning a tool to its rack.
The code for maintaining facts in the AL planning model and for
propagating them across control structures have been written and
debugged, as have the procedures for calculating and storing the
planning values used by the "low level" manipulator statements. Beyond
this, the compile-time expression evaluation and conditional
compilation facilities of AL have been substantially debugged.
A system has been developed to automate the translation of semantic
relations between objects and the corresponding mathematical
constraints, and to use the latter to produce range estimates for
location and error variables. The components of this system have
been run independently and the whole thing is being incorporated into
the AL planning model. We expect to use it quite extensively in
planning both for manipulation tasks and for visual feedback.
[Ambler and Popplestone] follow a very similar analysis in
translating relations between objects into mathematical constraints,
which they then solve algebraicly. They do not, however, deal with
ranges of values or with error estimates, and generally make less use
of the semantic relations themselves.
.cb Milestones
An initial version of AL will be operating by January
1976. A complete version will be operational by January 1977. That
version will include cooperative motion of two arms, force and touch
sensors, the parser, planning model, and user environment. A force
sensor with 25 grams sensitivity will be operational on an arm by
January 1976. Touch sensors with 2 gm sensitivity in a small matrix
of 2x3 will be operational on one hand by July 1976. We will
complete experiments analyzing texture and shape with touch and
formulate language primitives by January 1977. We will complete the
theoretical analysis, experimentation and formulation of language
primitives for force control by July 1976.
.begin "chart"
.area text lines 4 to 50;
.place text
.next page;
.nofill; select 5;
.narrow 5;
July 1975 to Nov 1976
1975 1976 1976
July Jan July
| | | | | | | | | | | | | | | | |
* | |
---AL prelim version--------→|------------final version------------------------→
----user system------------------------------------------------------→|
*
---planning system----------→|------more primitives,error recovery------------→|
---collision detection------------→|
*
---force sensing on arm----------→|
--analysis of force-----------------------→|
*-touch sensor expts------→|--operating on arm------------→|
-----texture expts------------→|
--expts shape by touch-----→|
*----arm 2----------------→|--cooperating arms expts-------→
ARM INTERFACE
*---planning system----------→|--------3d models-----------→|----stereo--------→|
VISUAL FEEDBACK
-------design---→| VISION LANGUAGE
* -----------implement 2nd version-------------------→|
--------optimal curve--------------------------→|
.end "chart";
.area text lines 4 to 53 IN 2 COLUMNS 5 APART
.ssname←NULL; next page;
.place text
.ss Vision
.cb Overview
We believe that our program addresses most of the major scientific
problems in machine perception:
.bc;
representation of shape
stereo descriptive mechanisms
texture descriptive mechanisms
programming classes of visual problems
using a large visual memory
what can be done with large computational power
goal-directed top-down verification
strategies with shape descriptors.
.end
We pose several questions:
(1) what tasks can be done with current
computing capacity? PDP10 class machines have about a million times
less computing power than the human visual system (Hubel-Wiesel and
stereo cells). By choosing our problems carefully, we can do a lot
with current computing power; the visual feedback tasks described
below are good examples. And we can investigate algorithms which
might require extensive computation.
(2) What can be done with a million times current computing capacity and
large memory? Special purpose computers well within the state of the
art can gain a factor of 1000. CCD signal processing elements and
revolutionary advances in semi-conductor technology promise further
gains. For example, Noyce of Intel states that the density of gates
on a chip increases a factor of two each year. If we assume only a
factor of two every two years, in twenty years we would gain the
other factor of 1000, or in 15 years with a factor of 10 increase in
logic speed. If we had such machines, we could make better use of
current techniques which we avoid now because of computation cost.
That would not solve our problems; there would still be many things
we don't know how to do, and we could not make full use of that
processing power. We are evaluating CCD signal processors and
algorithms suitable for that technology. We should not develop that
hardware; it is being developed by others. We can offer some
guidance on what is worthwhile to do in hardware. We should be
prepared to understand algorithms, even those which are prohibitively
costly in computation by current standards.
We consider two environments, programmable assembly and road scenes,
because they offer a range of complexity (e.g.. shadows, shapes,
motion, etc.) and a certain amount of structure which can be
capitalized upon in order to solve various tasks. Typical visual
feedback tasks within these environments are:
.bs;
(1) visually servo a bearing over a shaft
.begin indent 4,7;
(a) visually locate the shaft (i.e.. determine its angle
and the position of the end)
(b) visually locate the bearing with respect to the
end of the shaft
.end
(2) visually check to see if there is a screw on the end of
the screwdriver
(3) locate a bolt that has been accidently dropped into an engine
casing
(4) inspect the paint job on a car
(5) navigate a vehicle from one point to another along a road.
(6) determine what the object in the middle of the road is
(a box, a dog, a child, etc.).
.end
Our research in visual feedback differs from other work in having
more adequate representation of shape, stronger shape descriptive
mechanisms (curve descriptors) and making much stronger use of shape.
SRI uses a strategy system which finds objects on the basis of very
local properties, such as color and range data. Consider the three
subsystems: acquisition of candidates for the desired object;
verification of detailed match of candidates to the desired object;
automatic plan generation. Our acquisition procedures use shape
information, not just local properties. Similarly, our verification
procedures and strategy generation have better shape descriptors and
better shape representations.
.cb Representation
We have worked extensively on representation for shape of objects.
Our descriptive techniques for complex objects were determined by our
representation [Nevatia, Agin]. We now intend to represent other
data structures, control structures, and strategies for visual
perception. We assume that we can choose a small class of
non-equivalent representations and that they are not a collection of
ad hoc, unrelated structures, but reflect a common basis in 2D and 3D
geometry. We believe that effective automatic generation of
strategies is simplified by clear semantics for primitives and
interfaces.
.oi
A long range goal is to analyze the computational complexity of
structures and operations. The immediate goal is to characterize the
computation cost of primitives and simple control structures such as
searches, to evaluate effective strategies. This informal semantics
is a basis for comparing similar tasks, and programming one task in
analogy with another. It is also a basis for comparing different
programs and evaluating experiments. The semantics define a <vision
language>. We see it as a means of simplifying programming, as a way
of building up a system based on accumulated work, and as a means of
cooperation among various workers.
.cb Accomplishments
Execution: Bolles has written a program which uses a training picture
to characterize a curve (the contrast across it, its distinctness,
etc.) and is then able to locate a point (or segment) from that curve
in an `unknown' picture containing essentially the same curve. We
are quite familiar with other features: correlation [Quam] and
[Hannah], edge operator [Hueckel], region growing [Yakimovsky],
texture [Bajcsy], and contouring [Baumgart].
There are also programs available for interactive input of
two-dimensional and three-dimensional models. These are not
complete, but they are useful now. The two-dimensional system
allows a user to point out important features such as correlation
points, curves, and regions. Essentially the system displays a
picture on the screen and the user can `draw' on top of it, marking
important features. The three-dimensional program allows the user to
specify objects composed of parts (possibly unions, intersections, or
subtractions). Each part has a spine along which there are several
cross-sections. A cross-section may be any non-intersecting closed
curve. Thus, a shaft is represented as a straight spine with
circular cross-sections. A cube is represented as a straight spine
with square cross-sections. Curves (such as circles) are approximated
by lines when they are displayed.
Planning: We have used models within the blocks world to predict the
scene to be analyzed for visual feedback: [Gill] and [Perkins]. With
respect to real world planning, Binford has supervised the thesis of
Garvey at SRI on visual strategies for office scenes.
.cb Execution Program for Visual Feedback
We will use a set of primitive operators which already exist (curve
matching, correlation, contours) to locate features from a model.
Humans have difficulty guiding assemblies without stereo. We will
use stereo to measure spatial misalignment. The world model will be
a graph whose links are perceptual operations and whose nodes are the
symbolic outputs of those operators. An abbreviated evaluation of
the model graph at runtime will allow alternatives for contingency.
Either the user or the planning system for visual feedback will
provide effective strategies.
For example, in the screw-in-hole problem, the hole will be located
by correlating with a small area centered on the hole in the training
image. It is costly to search by correlation for the hole over a
large portion of an image, and prone to error. The hole is small;
this is like searching for a needle in a haystack. Instead, a long
curve may be inexpensive to find. A scan with the edge operator
along a single line intersects the curve. A few additional edge
vectors confirm that the model curve predicts the curve which was
found. Once the curve is found, it is possible to predict the
location of the hole or inexpensively locate another feature to
predict the location of the hole. Then correlation provides a final
precise location.
What are the limitations of our planned program for visual feedback?
.bc;
weak descriptors for texture
limited to small angular and position shifts; almost 2D.
.end
Descriptive primitives for texture are among the most important
problems. The work of Bajcsy [Bajcsy, Lieberman] is perhaps the best,
but enormous work remains to be done. The use of spatial features
obtained using stereo is common to all our planned vision work, and
works well with texture, although it is costly.
.cb Planning Program
It is possible to input the planning model in three different ways:
.bc;
manual 2D input
manual 3D input
automatic 2D and 3D input.
.end
The first is a manual 2D mode, with the user outlining region
features under keyboard control of a cursor, and the user making
associations between the 2D and 3D models. In the second mode, 3D
models will be input in our representation for objects. A 2D model
will be generated using computer graphics. In the third mode, a
program for automatic input from 2D images will characterize the
image in terms of stereo and region boundaries, and make association
with the 3D model. The first two exist in usable forms.
To generate execution programs, goals (locate hole, locate screw)
will be translated into a graph of model states linked by primitives.
A relaxation of the graph will determine an efficient subgraph to
attain the goals of locating parts.
.ss Planning Program Input Sub-objective
I. Automatically construct a 2D model of an image (screw in
distributor case) to simplify man-machine interface in visual
feedback programming.
II. Generate a 3D model of an assembly from a sample scene, with a
simple symbolic model as a guide.
.cb Accomplishments
Nevatia implemented a system which recognized a doll, a toy horse, a
glove, and several other objects from a structured visual memory [Nevatia]. The system
matched descriptions of objects against models of objects it had seen
before. The system made up its own models, which were descriptions
(sometimes modified by humans) of previously seen objects. For a
large visual memory, it is unreasonable to match an object
description against all models in memory. A beginning was made
toward recognition with a large visual memory, although only about
six models were used. Models were indexed according to summary
descriptions of the object shape. Only models with similar summary
descriptions were compared in detail.
Descriptions follow a representation of shape based on generalized
cone primitives [Binford]. In this representation, objects are
described in terms of parts. For example, a human has a body, four
armlike projections and another projection which isn't extended
(head). The primitive parts of the representation are armlike parts
defined by smoothly varying cross sections along a space curve
(generalized translational invariance).
Parts are described by the axis and cross section. The original data
is three-dimensional, although the description techniques are
valuable for TV data. Agin built a laser ranging system and
implemented a preliminary version of part description [Agin].
Nevatia's programs use depth data from the laser ranging system, find
boundaries of continuous surfaces, make descriptions of armlike parts
and piece together parts into complete descriptions [Nevatia].
.cb Plan
We plan to automate the building of the model used by the strategy
module in visual feedback. Now the programmer builds the model.
The final result of this will be that to program a visual feedback
task will require only putting down an example of of an assembly and
supplying a task statement (put the sleeve over the shaft). The
process will be equally automated for objects for which
computer-aided design models are available.
To facilitate setup, stereo will make it possible to build space
models of parts to be assembled. This will allow us to extend the
capabilities of execution programs for visual feedback to assembly
with large position variations (10 cm) or large angular variations.
In imagining applications of vision to assembly, we immediately think
of very constrained, repetitive visual feedback tasks. However,
picking parts from a bin is a common industrial assembly subtask.
The work on visual feedback provides modules (description,
representation, strategies) which make possible more powerful visual
systems. We will use a combination of depth discontinuities from
stereo and/or laser ranging device, and color region boundaries. We
will describe parts of region boundaries using techniques from
Nevatia [Nevatia]. We will build up spatial descriptions from
boundary information and 2d cues. We will separate objects based on
spatial boundaries and additional segmentation at places where
objects touch (objects on the table, for example). We will recognize
objects from spatial descriptions of parts and confirm the
segmentation into objects.
Region boundary techniques are inadequate because they
threshold on a very local context. Some global optimal curve search
techniques have been developed [Martelli, Chien], but these are much
too special for our purposes, and computationally expensive.
Our plan is to limit computation cost by 1) decreasing the number of
possible curves by limiting to smooth curves, and 2) by cascading sums
to form larger support. We will combine the output of two or more
Hueckel type operators [Hueckel] and threshold not on the Hueckel
disk, but over a larger support. We will analyze the theory of a
locally optimal curve technique, determine the computation cost, and
implement it if warranted.
.cb Milestones
We will complete a system for verification vision including an
execution module, a 2D world model, and a planning system by January
1977. The execution system will include stereo. We will test the
system by visual control of insertion of a screw into a hole in a
distributor body by Jan 1976, and put a sleeve over a shaft by July 1976.
We will
have both feature-based and area-based stereo in use by July 1976.
We will implement the model graph and evaluation programs to provide
automatic generation of programs for screw in hole by January 1976.
We will have 3D input with 2D image input from graphics by January
1977.
We will have a vision language with 2D model and model graph
facilities by July 1976.
We will make the theoretical analysis of the optimal curve search,
and determine the computation cost by July 1976. We will have an
automatic 2D model generation program by July 1976. The basic higher
level perceptual program structure will be developed for the planning
program for visual feedback.
.ss Analysis and Modelling of Natural Scenes
We propose to continue research in computer analysis of natural
scenes such as rugged terrain, desert, roads, and streets with
emphasis on applications to vehicle guidance. A practical solution
to the problem of navigating a vehicle through an unknown environment
requires the ability to incrementally acquire models for newly
experienced parts of the world, to verify and refine previous world
models, and to detect obstacles and moving objects.
Of primary importance are geometric models of the world, which enable
the selection of navigation routes satisfying constraints such as
surface roughness and slopes.
Photometric properties such as color, reflectivity, and texture will
be used along with geometric models to provide better scene
description and segmentation.
.cb Approach
We propose to analyze multiple views of the world, both conventional
left-right stereo and motion parallax. Pairs of images along with
geometric models for the camera position and orientation are adequate
to generate a 3D model for the portions of the scene common to the
pairs of views.
.cb Accomplishments
We have a substantial foundation of experience and
tools to build upon. Some of the major achievements are:
.bc;
a. Parallax region analyzer [Hannah];
b. Experimental automatic photogrammetry system [Quam];
c. Visual feedback cart servo program (Quam, Moravec);
d. GEOMED 3-dimensional modelling [Baumgart].
.end
.cb Types of Scenes and Images
We will digitize multiple views of a diverse collection of complex
natural scenes. Since this is primarily a geometric modelling
experiment, we will need measurements to define the location and
orientation of the camera for each view. These scenes are chosen to
cover a broad range of vehicle guidance environments. High
resolution is important. We would like at least 1200x800 pixel
digitization resolution, with 8 bits per primary color.
.ss Hardware plans
Our approach to hardware is: how can we get the maximum of research
from a minimum of system and hardware effort. Some considerations
are:
.bs;fac;
(1) vision is limited by computer speed, not by input device
speed,
(2) we prefer to buy commercial devices and interfaces rather than
develop our own,
(3) it is adequate, usually preferable, to work from disk images
for 95α% of our work
(4) we do not really require computer control of all camera facilities
and can wait until we have firm immediate requirements before
worrying about that
(5) it is extremely valuable to have color image output
.end
.hehard:
.cb Hand-held Camera
For visual feedback in assembly, it is planned to have a camera which
one hand can bring to the work area. We propose to purchase either a
small tv camera or a 100x100 solid state array from GE or Fairchild.
The solid state camera is much smaller. We may wait a year for
higher resolution solid state cameras. The price will be higher. A
unibus or mapping bus interface is adequate for these devices, which
have low data rate. The other considerations are mechanical and
computer interfacing of pan/tilt, etc. We will not do any pan/tilt,
since it can be just picked up by the hand. The cost, including
interface, will be about $6,000.
.cb Stereo Camera System
All of our vision projects rely extensively on stereo. A stable,
well-designed stereo configuration is needed which maintains
calibration.
The configuration depends upon a design study. We have not found a
suitable commercial system. The
cost will be about $15,000.
.bib
[Agin] Agin, Gerald J., Thomas O. Binford, "Computer Description of
Curved Objects", <Proc. Third International Joint
Conf. on Artificial Intelligence>, Stanford University, August
1973.
[Agin] Agin, G. J., "Representation and Description of Curved Objects" Stanford
Artificial Intelligence Project Memo No. 173, October 1972.
[Ambler] Ambler, A.P., H.G. Barrow, C.M.Brown, R.M.Burstall, R.J.Popplestone
"A Versatile Computer-Controlled Assembly System", Dept of Machine Intelligence,
University of Edinburgh
[Ambler and Popplestone] A. P. Ambler and R. J. Popplestone, "Inferring
the Positions of Bodies from Specified Spatial Relationships", manuscript,
Dept. of Machine Intelligence, University of Edinburgh.
[Bajcsy] Bajcsy, Ruzena, "Computer Description of Textured Scenes", <Proc.
Third Int. Joint Conf. on Artificial Intelligence>, Stanford U.,
1973.
[Baumgart] Bruce G. Baumgart, "GEOMED - A Geometric Editor",
AIM-232, May 1974.
[Baumgart] Bruce G. Baumgart, "Geometric Modeling for Computer Vision",
AIM-249, October 1974.
[Binford] T.O. Binford, "Visual Perception by Computer", Invited paper
at <IEEE Systems Science and Cybernetics,> Miami, December 1971.
[Bolles] Bolles, R. C. and Paul, R., "The Use of Sensory Feedback in a
Programmable Assembly System", Stanford Artificial Intelligence Project
Memo No. 220, October 1973.
[Finkel]
Raphael Finkel, Russell Taylor, Robert Bolles, Richard Paul, Jerome Feldman,
"AL, A Programming System for Automation", AIM-243, November 1974.
[Gill] Gill, A., "Visual Feedback and Related Problems in Computer Controlled
Hand-Eye Coordination", Stanford Artificial Intelligence Project Memo
No. 178, October 1972.
[Goto] T. Goto, T. Inoyama and K. Takeyasu,"Precise Insert Operation by
Tactile Controlled Robot "HI-T-HAND EXPERT-2"", Proceedings 4th International
Conference on Industrial Robots, p209, 1974
[Hannah] Marsha Jo Hannah, "Computer Matching of Areas in Stereo Images",
<Ph.D. Thesis in Computer Science>, AIM-239, July 1974.
[Hueckel] M.H. Hueckel, "An Operator Which Locates Edges in Digitized
Pictures", AIM-105, December 1969; also in JACM, Vol. 18, No. 1, January 1971.
[Inoue] H. Inoue, "Force Feedback in Precise Assembly Tasks", MIT AI
Memo.
[Lieberman] Lawrence Lieberman, "Computer Recognition and Description of
Natural Scenes", PhD Dissertation, Univ of Pennsylvania, 1974
[Luckham] David Luckham and Jack Buchanan, "Automatic Generation of
Programs Containing Conditional Statements", <Proc.
A.I.S.B. Summer Conference,> Sussex, England, July 1974.
[Nakano] E.Nakano, X. Ozaki, T. Ishida, I. Kato, "Cooperational control
of the Anthropomorphous Manipulator "MELARM"", Proc 4th International
Conference on Industrial Robots, p251, 1974.
[Nevatia] R.K. Nevatia and T.O. Binford, "Structured Descriptions of
Complex Objects", <Third Int. Joint Conf. on AI>, Stanford, Calif, 1973.
[Paul] R. Paul, "Modelling, Trajectory Calculation and Servoing of a
Computer Controlled Arm", <Ph.D. Thesis in Computer Science,>
AIM-177, September 1972.
[Perkins] Walton A. Perkins, Thomas O. Binford.,
"A Corner Finder for Visual Feedback", AIM-214, September 1973.
[Quam] Quam, Lynn H. "Computer Comparison of Pictures", Stanford Artificial
Intelligence Project Memo No. 144.
[Quam] Lynn Quam, Marsha Jo Hannah, "Stanford Automatic Photogrammetry Research",
AIM-254, November 1974.
[Sacerdoti] Earl D. Sacerdoti, "The Nonlinear Nature of Plans". Stanford
Research Institute Artificial Intelligence Group Technical Note 101,
January, 1975.
[Scheinman] V. D. Scheinman, Design of a Computer Manipulator,
Stanford Artificial Intelligence Project, Memo AIM-92, June 1969.
[Yakimovsky] Yakimovsky, Y., "Scene Analysis using a Semantic Base for Region
Growing", Stanford Artificial Intelligence Project Memo No. 209.
[Yakimovsky] Yakimovsky, Y. and Feldman, J., "A Semantics-based Decision Theory
Region Analyzer", Proceedings of the Third International Joint
.CB FILMS
⊗ Richard Paul and Karl Pingle, "Instant Insanity", 16mm color, silent
6 min, August 1971.
⊗ Pingle, Paul and Bolles "Automated Pump Assembly", 16mm color,
Silent, 7min, April 1973.
⊗ Pingle, Paul and Bolles, "Automated Assembly, Three Short Examples",
16mm color, sound, November 1974.
.end